Welcome!

In these lectures

Week 1

  • Statistical basics
    • Mean, median and mode (what is an average?)
    • Standard deviation (how spread out are data?)
  • Statistical graphics
    • Histograms
    • Scatter plots
    • Lines of Best Fit

Week 2

  • Binomial distribution (many coin flips)
  • Bernoulli’s distribution (one coin flip)
  • Normal distribution (the bell curve)

Week 3

  • Linear regression (fitting a line)
  • Logistic regression (predicting categories)

Who am I?

My story

  • grew up in Sydney
  • undergrad, Honours Sydney Uni 2009-2012
    • in physics
  • exchange at the University of California in Berkeley (San Francisco)
    • you should consider exchange too!
  • Doctorate at Oxford 2013-2017
    • detecting planets with Kepler
  • Postdoc at New York University 2017-2020
    • in data science and physics departments
    • more Kepler stuff, imaging planets, radio stars
  • Lecturer at the University of Queensland 2021-2024
  • Just started here - be patient!
  • Don’t hesitate to ask anything - about stats, careers, or uni

A/Prof Benjamin Pope (me)

Why is an astronomer a statistician?

In my research, I want to

  • learn how stars work
  • detect planets around them
  • develop technology for doing this better

James Webb Space Telescope

All of these problems are data analysis problems.

What do we expect?

  • you may or may not have a stats background - this is fine!
    • stats is data science and data science is stats - you want to learn this material well
  • you may not have much experience with Python or Git - put work into getting good at these!
  • come to lectures & meet your peers - this is really valuable!
  • ask questions on the online discussion forum first, and feel free to email too - but it’s great if we can answer a question lots of people might have

What do we mean by statistics?

What is data science?

$$$
it's stats
it’s stats

What do you mean by statistics?

  • the Census?
  • spreadsheets?
  • averages and standard deviations?
  • probability theory?
  • All of these things!

What do I mean by statistics?

Statistics is the science of reasoning about uncertainty.

Data are always noisy and incomplete, and the art here is in properly accounting for this and getting reliable, accurate, precise results.

We’ll study how to

  • gather data;
  • visualize data;
  • summarize data;
  • fit models to data;
  • interpret the models;
  • and make decisions.

Today

Let’s start with how to gather, visualize and summarize data.

Public Datasets

There are a lot of data available in public sources. One is the Australian Bureau of Statistics; we might use some of those later.

But another - which might help you get a job! - is Kaggle, for data science competitions. They host public datasets and you can compete to produce the best models to explain and predict them.

House Price Data

Let’s start with something that’s probably on everybody’s minds - a Kaggle dataset on Sydney house prices.

This comes as a .csv file: comma separated values. It looks like this:

price,date_sold,suburb,num_bath,num_bed,num_parking,property_size,type,suburb_population,suburb_median_income,suburb_sqkm,suburb_lat,suburb_lng,suburb_elevation,cash_rate,property_inflation_index,km_from_cbd
530000,13/1/16,Kincumber,4,4,2,1351,House,7093,29432,9.914,-33.47252,151.40208,24,2,150.9,47.05
525000,13/1/16,Halekulani,2,4,2,594,House,2538,24752,1.397,-33.21772,151.55237,23,2,150.9,78.54
480000,13/1/16,Chittaway Bay,2,4,2,468,House,2028,31668,1.116,-33.32678,151.44557,3,2,150.9,63.59
452000,13/1/16,Leumeah,1,3,1,344,House,9835,32292,4.055,-34.05375,150.83957,81,2,150.9,40.12
365500,13/1/16,North Avoca,0,0,0,1850,Vacant land,2200,45084,1.497,-33.45608,151.43598,18,2,150.9,49.98
...

Loading a CSV

We’re going to use pandas, an open source package for loading tables in Python; you’ll use this a lot!

import pandas as pd

df = pd.read_csv('./domain_properties.csv') # df stands for DataFrame
print(df)
         price date_sold         suburb  num_bath  num_bed  num_parking  \
0       530000   13/1/16      Kincumber         4        4            2   
1       525000   13/1/16     Halekulani         2        4            2   
2       480000   13/1/16  Chittaway Bay         2        4            2   
3       452000   13/1/16        Leumeah         1        3            1   
4       365500   13/1/16    North Avoca         0        0            0   
...        ...       ...            ...       ...      ...          ...   
11155  1900000  31/12/21     Kellyville         3        4            2   
11156  1300000  31/12/21    Seven Hills         3        7            2   
11157  1025000  31/12/21         Sydney         2        2            1   
11158  1087500    1/1/22       Prestons         2        4            2   
11159  1000000    1/1/22       Ourimbah         2        3            2   

       property_size                     type  suburb_population  \
0               1351                    House               7093   
1                594                    House               2538   
2                468                    House               2028   
3                344                    House               9835   
4               1850              Vacant land               2200   
...              ...                      ...                ...   
11155            540                    House              27971   
11156           1208                    House              19326   
11157            129  Apartment / Unit / Flat              17252   
11158            384                    House              15313   
11159            667                    House               3951   

       suburb_median_income  suburb_sqkm  suburb_lat  suburb_lng  \
0                     29432        9.914   -33.47252   151.40208   
1                     24752        1.397   -33.21772   151.55237   
2                     31668        1.116   -33.32678   151.44557   
3                     32292        4.055   -34.05375   150.83957   
4                     45084        1.497   -33.45608   151.43598   
...                     ...          ...         ...         ...   
11155                 46228       18.645   -33.69583   150.95622   
11156                 33540        9.629   -33.77743   150.94272   
11157                 35412        2.940   -33.86794   151.20998   
11158                 36244        9.215   -33.94155   150.87334   
11159                 37180       87.154   -33.31517   151.32611   

       suburb_elevation  cash_rate  property_inflation_index  km_from_cbd  
0                    24        2.0                     150.9        47.05  
1                    23        2.0                     150.9        78.54  
2                     3        2.0                     150.9        63.59  
3                    81        2.0                     150.9        40.12  
4                    18        2.0                     150.9        49.98  
...                 ...        ...                       ...          ...  
11155                78        0.1                     220.1        30.08  
11156                38        0.1                     220.1        26.58  
11157                65        0.1                     220.1         0.31  
11158                28        0.1                     220.1        32.26  
11159               191        0.1                     220.1        61.95  

[11160 rows x 17 columns]
So we see that a data frame has columns, each of which corresponds to some property of the data points, like price, suburb, etc. Every individual house sold is a row in this table.

Selecting Data

Let’s look at only those data from 2021:

#| code-line-numbers: "1|2|3"
df.year = pd.to_datetime(df.date_sold).dt.year # this is because dates in csv are strings but we want to extract year
df21 = df[df.year==2021]
print(df21)
         price date_sold              suburb  num_bath  num_bed  num_parking  \
5830   2800000   13/1/21           Wagstaffe         3        4            3   
5831   1315000   13/1/21  The Entrance North         2        4            2   
5832   2640000   14/1/21            Clovelly         3        5            0   
5833   1825000   14/1/21          Willoughby         2        4            1   
5834   1200000   14/1/21      Hamlyn Terrace         3        5            5   
...        ...       ...                 ...       ...      ...          ...   
11153  2055000  30/12/21         Carlingford         2        5            2   
11154  1900000  30/12/21              Mascot         3        4            2   
11155  1900000  31/12/21          Kellyville         3        4            2   
11156  1300000  31/12/21         Seven Hills         3        7            2   
11157  1025000  31/12/21              Sydney         2        2            1   

       property_size                     type  suburb_population  \
5830             866                    House                222   
5831             557                    House               1474   
5832             234                    House               4736   
5833             260                    House               6540   
5834            1499                    House               6069   
...              ...                      ...                ...   
11153            689                    House              24394   
11154            285                    House              14772   
11155            540                    House              27971   
11156           1208                    House              19326   
11157            129  Apartment / Unit / Flat              17252   

       suburb_median_income  suburb_sqkm  suburb_lat  suburb_lng  \
5830                  41964        0.398   -33.52349   151.34031   
5831                  31148        1.018   -33.33333   151.50775   
5832                  67236        0.786   -33.91153   151.26289   
5833                  55692        1.628   -33.80664   151.20155   
5834                  29952        5.215   -33.25323   151.47277   
...                     ...          ...         ...         ...   
11153                 33696        8.528   -33.77523   151.04540   
11154                 41912       11.961   -33.94660   151.18371   
11155                 46228       18.645   -33.69583   150.95622   
11156                 33540        9.629   -33.77743   150.94272   
11157                 35412        2.940   -33.86794   151.20998   

       suburb_elevation  cash_rate  property_inflation_index  km_from_cbd  
5830                 47        0.1                     183.1        39.78  
5831                  0        0.1                     183.1        65.14  
5832                 28        0.1                     183.1         7.11  
5833                 90        0.1                     183.1         6.53  
5834                 17        0.1                     183.1        72.13  
...                 ...        ...                       ...          ...  
11153               102        0.1                     220.1        18.20  
11154                 3        0.1                     220.1         9.35  
11155                78        0.1                     220.1        30.08  
11156                38        0.1                     220.1        26.58  
11157                65        0.1                     220.1         0.31  

[5328 rows x 17 columns]

Histograms

Now let’s make the most basic visualization of a dataset - a histogram.

You should almost always do this!

We are going to use another package you are going to learn inside and out: matplotlib.

import matplotlib.pyplot as plt # this makes plots in python
plt.hist(df21.price/1e6,bins=100); # semicolon to suppress output; /1e6 to make readable

plt.xlabel('House Price ($M)') # always label your axes!
plt.ylabel('Number of Houses');

Filtering Data

There are some expensive houses in Sydney! Let’s look at the lower end:

realistic = df21[df21.price < 5e6]
plt.hist(realistic.price/1e6,bins=100); # semicolon to suppress output; /1e6 to make readable

plt.xlabel('House Price ($M)') # always label your axes!
plt.ylabel('Number of Houses');

Summary Statistics

Let’s talk about the mean, the median and percentiles, and the mode as ways of talking about a distribution.

The mean is defined as \[ \langle{x}\rangle \equiv \frac{1}{N} \sum_{i=1}^{N} x_i \]

ie this is the contribution to the total, per item
The median is the value of \(x\) such that 50% of samples are higher, and 50% are lower: i.e. the middle of the distribution. More generally, a percentile is defined so that (say) 90% of samples are less than the 90th percentile.
The mode is the most common value.

NumPy

For doing maths like this on data, we want to use numpy, the standard Python package for numerical calculations:

import numpy as np # always as np!

Summary Statistics

Let’s plot the same histogram as above, but showing the summary statistics.

h = plt.hist(realistic.price/1e6,bins=100); # semicolon to suppress output; /1e6 to make readable

plt.xlabel('House Price ($M)') # always label your axes!
plt.ylabel('Number of Houses')

plt.axvline(np.mean(realistic.price/1e6),
    ls='-',lw=5,color='C0',label = 'Mean')
plt.axvline(np.median(realistic.price/1e6),
    ls=':',lw=5,color='C1',label = 'Median')

for percentile in [10, 90]: # this is a for loop for doing multiple things
    plt.axvline(np.percentile(realistic.price/1e6,percentile),
            ls=':',color='C1', lw=3,label = f'{percentile}th Percentile') # this is an f-string for printing things

mode = np.argmax(h[0])
mode_price = h[1][mode]
plt.axvline(mode_price, ls='--', color='C2',lw=5, label='Mode')

plt.legend()

Relationships

The core thing we want to do in data science is to make inferences from data. This means finding relationships in data to help us predict or understand what is happening.

Side note: infer vs imply. What’s the difference?

Data imply things to us.
We infer things from data.

Trend Lines

Let’s see if we can plot a trend line for prices over time:

years = np.arange(2016,2022,1) # integers up to 2021 - yes 2021
means, lowers, uppers = [], [], [] # init empty lists

for year in years: 
    thisdf = df[df.year==year]
    means.append(np.mean(thisdf.price))
    lowers.append(np.percentile(thisdf.price,25))
    uppers.append(np.percentile(thisdf.price,75))
# make these into arrays
means, lowers, uppers = np.array(means), np.array(lowers), np.array(uppers)

Plot this

Now we can see a (depressing) trend with time:

plt.plot(years, means/1e6, 'C2--',label='Mean Price')
plt.fill_between(years, lowers/1e6, uppers/1e6, color='C2', alpha=0.2,label='25-75 percentile')

plt.ylabel('Price ($M)')
plt.xlabel('Year')
plt.xticks(years)
plt.title('Sydney House Prices in 2018-2021')
plt.xlim(years.min(),years.max())
plt.legend(loc='upper left');

Scatter Plots

What if we want to see how multiple things relate, not just time?

We can use a scatter plot, in which each individual data point is rendered as a dot on whatever axes we like. Let’s see how property size relates to price:

houses = realistic[realistic.type == 'House']

plt.scatter(houses['property_size'], houses['price']/1e6, alpha=0.5, color='C2',label='House')

plt.xlim(0,750)
plt.xlabel('Property Size (sqm)')
plt.ylabel('Price ($M)')
plt.legend();

Multiple Datasets

We can do the same comparison for apartments and terraces and overlay them:

Coloured Scatter Plots

We don’t have to be restricted to just representing 2 dimensions: we can even put a colour map on the data to represent a third quantity!

houses.sort_values('km_from_cbd', inplace=True, ascending=True)
plt.scatter(houses['property_size'], houses['price']/1e6, alpha=0.8, s=4,c=houses['km_from_cbd'],
            label='House',cmap='inferno')
plt.colorbar(label='Distance from CBD (km)')
plt.xlim(0,2000)
plt.xlabel('Property Size (sqm)')
plt.ylabel('Price ($M)')
plt.title('Freestanding House Prices');

Fitting a Line to Data

The most important thing in your whole job will be learning from data: finding a mathematical representation that explains current data and predicts new data.

This can be as simple as fitting a line.

In the next lectures, we’re going to learn how to do this: but let’s see what we mean by that!

Text(0, 0.5, 'Student Sleepiness')

I think that’s time to end the lecture!

Recap

Today we’ve learned about:

  • loading data in Python
  • summary statistics
  • histograms
  • scatter plots

Next week we’re going to learn:

  • linear regression
  • distributions:
    • normal
    • uniform
    • coin-toss